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Abstract 

The paper presents a new sampling methodology for Bayesian networks that samples 
only a subset of variables and applies exact inference to the rest. Cutset sampling is a 
network structure-exploiting application of the Rao-Blackwellisation principle to sampling 
in Bayesian networks. It improves convergence by exploiting memory-based inference algo- 
rithms. It can also be viewed as an anytime approximation of the exact cutset-conditioning 
algorithm developed by Pearl. Cutset sampling can be implemented efficiently when the 
sampled variables constitute a loop-cutset of the Bayesian network and, more generally, 
when the induced width of the network's graph conditioned on the observed sampled vari- 
ables is bounded by a constant w. We demonstrate empirically the benefit of this scheme 
on a range of benchmarks. 



1. Introduction 

Sampling is a common method for approximate inference in Bayesian networks. When 
exact algorithms are impractical due to prohibitive time and memory demands, it is often 
the only feasible approach that offers performance guarantees. Given a Bayesian network 
over the variables X = {X\, X n }, evidence e, and a set of samples {x^} from P(X\e), an 
estimate f(X) of the expected value of function f(X) can be obtained from the generated 
samples via the ergodic average: 

E[f(X)\e]^f(X) = ^f( x(t) ^ (!) 

where T is the number of samples. f{X) can be shown to converge to the exact value as T 
increases. The central query of interest over Bayesian networks is computing the posterior 
marginals P{xi\e) for each value Xi of variable Xi, also called belief updating. For this 
query, f{X) equals a (5-function, and the above equation reduces to counting the fraction 
of occurrences of Xi = Xi in the samples, 

1 T 

P( Xi \e) = -£<5(^W) , (2) 
t=i 

where 5(xi\x^)=l iff Xi = xf^ and 5(xi\x^)=0 otherwise. Alternatively, a mixture estima- 
tor can be used, 

P{x i \e)] = ±Y j P{x i \ X V), (3) 
t=i 
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where x_\ = x^\xi. 

A significant limitation of sampling, however, is that the statistical variance increases 
when the number of variables in the network grows and therefore the number of samples 
necessary for accurate estimation increases. In this paper, we present a sampling scheme for 
Bayesian networks with discrete variables that reduces the sampling variance by sampling 
from a subset of the variables, a technique also known as collapsing or Rao-Blackwellisation. 
The fundamentals of Rao-Blackwellised sampling were developed by Casella and Robert 
(1996) and Liu, Wong, and Kong (1994) for Gibbs sampling and MacEachern, Clyde, 
and Liu (1998) and Doucet, Gordon, and Krishnamurthy (1999) for importance sampling. 
Doucet, de Freitas, Murphy, and Russell (2000) extended Rao-Blackwellisation to Particle 
Filtering in Dynamic Bayesian networks. 

The basic Rao-Blackwellisation scheme can be described as follows. Suppose we parti- 
tion the space of variables X into two subsets C and Z. Subsequently, we can re-write any 
function f(X) as f(C,Z). If we can generate samples from distribution P(C\e) and com- 
pute E[f(C,Z)\c,e], then we can perform sampling on subset C only, generating samples 
c^ 2 \ and approximating the quantity of interest by 

1 T 

E[f(C,Z)\e] = E c [E z [f(C,Z)\c,e}} « f(X) = - £ E z [f(C, Z)\c«\ e] . (4) 

t=i 

The posterior marginals' estimates for the cutset variables can be obtained using an expres- 
sion similar to Eq.(2), 

P(c l |e) = ^<5(c i |^ ) ), (5) 
t 

or using a mixture estimator similar to Eq.(3), 

P(c i \e) = ^P(c i \c%e). ( 6) 

t 

For Xi G X\C,E, E[P(Xi\e)\ = E c [P(Xi\c,e)\ and Eq.(4) becomes 

P{X l \e) = } r Y. P (xM t \e) . (7) 

t 

Since the convergence rate of Gibbs sampler is tied to the maximum correlation between 
two samples (Liu, 2001), we can expect an improvement in the convergence rate when 
sampling in a lower dimensional space since 1) some of the highly-correlated variables may be 
marginalized out and 2) the dependencies between the variables inside a smaller set are likely 
to be weaker because the variables will be farther apart and their sampling distributions will 
be smoothed out. Additionally, the estimates obtained from sampling in a lower dimensional 
space can be expected to have lower sampling variance and therefore require fewer samples 
to achieve the same accuracy of the estimates. On the other hand, the cost of generating 
each sample may increase. Indeed, the principles of Rao-Blackwellised sampling have been 
applied only in a few classes of probabilistic models with specialized structure (Kong, Liu, 
& Wong, 1994; Escobar, 1994; MacEachern, 1994; Liu, 1996; Doucet & Andrieu, 2001; 
Andrieu, de Freitas, k Doucet, 2002; Rosti & Gales, 2004). 
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The contribution of this paper is in presenting a general, structure-based scheme which 
applies the Rao-Blackwellisation principle to Bayesian networks. The idea is to exploit the 
property that conditioning on a subset of variables simplifies the network's structure, allow- 
ing efficient query processing by exact algorithms. In general, exact inference by variable 
elimination (Dechter, 1999a, 2003) or join-tree algorithms (Lauritzen k, Spiegelhalter, 1988; 
Jensen, Lauritzen, & Olesen, 1990) is time and space exponential in the induced-width w 
of the network. However, when a subset of the variables is assigned (i.e., conditioned upon) 
the induced-width of the conditioned network may be reduced. 

The idea of cutset sampling is to choose a subset of variables C such that conditioning 
on C yields a sparse enough Bayesian network having a small induced width to allow exact 
inference. Since a sample is an assignment to all cutset variables, we can efficiently generate 
a new sample over the cutset variables in the conditioned network where the computation of 
P{c\e) and P(Xi\c, e) can be bounded. In particular, if the sampling set C cuts all the cycles 
in the network (i.e., it is a loop-cutset), inference over the conditioned network becomes 
linear. In general, if C is a -w-cutset, namely a subset of nodes such that when assigned, the 
induced- width of the conditioned network is w, the time and space complexity of computing 
the next sample is 0(\C\ ■ N ■ d w+2 ) where d is the maximum domain size and N = \X\. 

The idea of exploiting properties of conditioning on a subset of variables was first pro- 
posed for exact belief updating in the context of cutset-conditioning (Pearl, 1988). This 
scheme requires enumerating all instantiations of the cutset variables. Since the number of 
instances is exponential in the size of the cutset |C|, sampling over the cutset space may 
be the right compromise when the size of the cutset is too big. Thus, sampling on a cutset 
can also be viewed as an anytime approximation of the cutset-conditioning approach. 

Although Rao-Blackwellisation in general and cutset sampling in particular can be ap- 
plied in the context of any sampling algorithm, we will introduce the principle in the con- 
text of Gibbs sampling (Geman & Geman, 1984; Gilks, Richardson, & Spiegelhalter, 1996; 
MacKay, 1996), a Markov Chain Monte Carlo sampling method for Bayesian networks. 
Extension to any other sampling approach or any other graphical models, such as Markov 
networks, should be straight forward. We recently demonstrated how the idea can be in- 
corporated into importance sampling (Bidyuk & Dechter, 2006). 

The paper defines and analyzes the cutset sampling scheme and investigates empirically 
the trade-offs between sampling and exact computation over a variety of randomly generated 
networks and grid structure networks as well as known real-life benchmarks such as CPCS 
networks and coding networks. We show that cutset sampling converges faster than pure 
sampling in terms of the number of samples, as dictated by theory, and is also almost always 
time-wise cost effective on all the benchmarks tried. We also demonstrate the applicability of 
this scheme to some deterministic networks, such as Hailfinder network and coding networks, 
where the Markov Chain is non-ergodic and Gibbs sampling does not converge. 

Section 2 provides background information. Specifically, section 2.1 introduces Bayesian 
networks, section 2.2 reviews exact inference algorithms for Bayesian networks, and sec- 
tion 2.3 provides background on Gibbs sampling. The contribution of the paper presenting 
the cutset sampling starts in section 3. Section 4 presents the empirical evaluation of cutset 
sampling. We also present an empirical evaluation of the sampling variance and the result- 
ing standard error based on the method of batch means (for more details, see Geyer, 1992). 
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In section 5, we review previous application of Rao-Blackwellisation and section 6 provides 
summary and conclusions. 

2. Background 

In this section, we define essential terminology and provide background information on 
Bayesian networks. 

2.1 Preliminaries 

We use upper case letters without subscripts, such as X, to denote sets of variables and 
lower case letters without subscripts to denote an instantiation of a group of variables (e.g., 
x indicates that each variable in set X is assigned a value). We use an upper case letter 
with a subscript, such as Xi, to denote a single variable and a lower case letter with a 
subscript, such as Xj, to denote an instantiated variable (e.g., Xj denotes an arbitrary value 
in the domain of Xi and means Xi = xi). T>{Xi) denotes the domain of the variable Xi. A 
superscript in a subscripted lower case letter would be used to distinguish different specific 
values for a variable, i.e., V(Xi) = {x\,x\, ...}. We will use x to denote an instantiation of 
a set of variables x = {x±, Xj_i, X,, Xj+i, x n } and x_j = x\xj to denote x with element 
Xi removed. Namely, x_j = {xi, X2, Xj_i, Xj+i, x n }. 

Definition 2.1 (graph concepts) A directed graph is a pair D =<V ,E> , where V = 
{X\, X n } is a set of nodes and E = {(Xi, Xj)\Xi, Xj £ V} is the set of edges. Given 
(Xi,Xj) £ E, Xi is called a parent of Xj, and Xj is called a child of Xi. The set of 
Xi's parents is denoted pa(Xi) , orpai, while the set of X-i's children is denoted ch(Xi), or 
chi. The family of Xi includes Xi and its parents. The moral graph of a directed graph 
D is the undirected graph obtained by connecting the parents of all the nodes in D and 
removing the arrows. A cycle-cutset of an undirected graph a subset of nodes that, when 
removed, yields a graph without cycles. A loop in a directed graph D is a subgraph of D 
whose underlying graph is a cycle. A directed graph is acyclic if it has no directed loops. A 
directed graph is singly-connected ( also called a poly-tree ), if its underlying undirected 
graph has no cycles. Otherwise, it is called multiply-connected. 

Definition 2.2 (loop-cutset) A vertex v is a sink with respect to a loop C if the two 
edges adjacent to v in C are directed into v. A vertex that is not a sink with respect to a 
loop C is called an allowed vertex with respect to C. A loop-cutset of a directed graph D 
is a set of vertices that contains at least one allowed vertex with respect to each loop in D. 

Definition 2.3 (Belief Networks) Let X = {X±, X n } be a set of random variables 
over multi-valued domains T>(Xi), T>(X n ). A belief network (BN) (Pearl, 1988) is 
a pair <G, P> where G is a directed acyclic graph whose nodes are the variables X and 
P = {P(Xi\pai)\i = l,...,n} is the set of conditional probability tables (CPTs) associated 
with each Xi. The BN represents a joint probability distribution having the product form: 

n 

P(x 1 ,....,x n ) = Y[P(x i \pa(X i )) 

i=i 

An evidence e is an instantiated subset of variables E C X. 
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The structure of the directed acyclic graph G reflects the dependencies between the 
variables using (^-separation criterion. The parents of a variable Xj together with its children 
and parents of its children form a Markov blanket, denoted markovi, of node X{. We 
will use x mar kovi to denote x restricted to variables in markovi. We know that the node 
Xi is independent of the rest of the variables conditioned on its Markov blanket. Namely, 

P{Xj\X—i) — P{xi | X mar k OVi ) • 

The most common query over belief networks is belief updating which is the task of 
computing the posterior distribution P(Xi\e) given evidence e and a query variable Xi G 
X. Reasoning in Bayesian networks is NP-hard (Cooper, 1990). Finding approximate 
posterior marginals with a fixed accuracy is also NP-hard (Dagum & Luby, 1993; Abdelbar 
& Hedetniemi, 1998). When the network is a poly-tree, belief updating and other inference 
tasks can be accomplished in time linear in size of the input. In general, exact inference is 
exponential in the induced width of the network's moral graph. 

Definition 2.4 (induced-width) The width of a node in an ordered undirected graph 
is the number of the node's neighbors that precede it in the ordering. The width of an 
ordering d, denoted w(d), is the maximum width over all nodes. The induced width 
of an ordered graph, w*{d), is the width of the ordered graph obtained by processing the 
nodes from last to first as follows: when node X is processed, all its preceding neighbors are 
connected. The resulting graph is called induced graph or triangulated graph. 

The task of finding the minimal induced width of a graph (over all possible orderings) is 
NP-complete (Arnborg, 1985). 

2.2 Reasoning in Bayesian Networks 

Belief propagation algorithm, which we introduce in Section 2.2.2 below, performs belief 
updating in singly-connected Bayesian networks in time linear in the size of the input 
(Pearl, 1988). In loopy networks, the two main approaches for belief updating are cutset 
conditioning and tree clustering. These algorithms are often referred to as "inference" 
algorithms. We will briefly describe the idea of clustering algorithms in Section 2.2.1 and 
the conditioning method in Section 2.2.3. 

2.2.1 Variable Elimination and Join-Tree Clustering (JTC) 

The join-tree clustering approach (JTC) refers to a family of algorithms including join- 
tree propagation (Lauritzen <fe Spiegelhalter, 1988; Jensen et al., 1990) and bucket-tree 
elimination (Dechter, 2003, 1999a). The idea is to first obtain a tree-decomposition of 
the network into clusters of functions connected as a tree and then propagate messages 
between the clusters in the tree. The tree-decomposition is a singly-connected undirected 
graph whose nodes, also called clusters, contain subsets of variables and input functions 
defined over those variables. The tree-decomposition must contain each function once and 
satisfy running intersection property (Maier, 1983). For a unifying perspective of tree- 
decomposition schemes see (Zhang Sz Poole, 1994; Dechter, 1999b; Kask, Dechter, Larrosa, 
& Dechter, 2005). 

Given a tree-decomposition of the network, the message propagation over this tree can 
be synchronized. We select any one cluster as the root of the tree and propagate messages 
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up and down the tree. A message from cluster Vi to neighbor Vj is a function over the 
separator set Vi n Vj that is a marginalization of the product of all functions in Vi and 
all messages that Vi received from its neighbors besides Vj. Assuming that the maximum 
number of variables in a cluster is w + 1 and maximum domain size is d, the time and 
space required to process one cluster is 0(d^ w+1 ^). Since the maximum number of clusters 
is bounded by |A| = N, the complexity of variable-elimination algorithms and cluster-tree 
propagation schemes is 0(N ■ d^ w+1 ^). The parameter w, the maximum cluster size minus 
1, is called the tree- width of the tree decomposition. The minimal tree- width is identical to 
the minimal induced width of a graph. 

2.2.2 Iterative Belief Propagation (IBP) 

Belief propagation (BP) is an iterative message-passing algorithm that performs exact infer- 
ence for singly-connected Bayesian networks (Pearl, 1988). In each iteration, every node Xj 
sends a iTj(Xi) message to each child j and receives a Aj(Aj) message from each child. The 
message-passing order can be organized so that it converges in two iterations. In essence the 
algorithm is the same as the join-tree clustering approach applied directly to the poly-tree. 
Applied to Bayesian networks with loops, the algorithm usually iterates longer (until it may 
converge) and hence, is known as Iterative Belief Propagation (IBP) or loopy belief prop- 
agation. IBP provides no guarantees on convergence or quality of approximate posterior 
marginals but was shown to perform well in practice (Rish, Kask, & Dechter, 1998; Murphy, 
Weiss, & Jordan, 1999). It is considered the best algorithm for inference in coding networks 
(Frey h MacKay, 1997; Kschischang h Frey, 1998) where finding the most probable variable 
values equals the decoding process (McEliece, MacKay, & Cheng, 1997). Algorithm IBP 
requires linear space and usually converges fast if it converges. In our benchmarks, IBP 
converged within 25 iterations or less (see Section 4). 

2.2.3 Cutset Conditioning 

When the tree- width w of the Bayesian network is too large and the requirements of inference 
schemes such as bucket elimination and join-tree clustering (JTC) exceed available memory, 
we can switch to the alternative cutset conditioning schemes (Pearl, 1988; Peot & Shachter, 
1992; Shachter, Andersen, & Solovitz, 1994). The idea of cutset conditioning is to select 
a subset of variables C C X\E, the cutset, and obtain posterior marginals for any node 
Xi e X\C, E using: 

P{xi\e)= P{xi\c,e)P(c\e) (8) 

cev(c) 

Eq.(8) above implies that we can enumerate all instantiations over C, perform exact infer- 
ence for each cutset instantiation c to obtain P(xi\c, e) and P(c\e) and then sum up the 
results. The total computation time is exponential in the size of the cutset because we have 
to enumerate all instantiations of the cutset variables. 

If C is a loop-cutset, then, when the nodes in C are assigned, the Bayesian network can 
be transformed into an equivalent poly-tree and P(xi\c, e) and P(c\e) can be computed via 
BP in time and space linear in the size of the network. For example, the subset {A, D} is a 
loop-cutset of the belief network shown in Figure 1, left, with evidence E = e. On the right, 
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Figure 1: When nodes A and D in the loopy Bayesian network (left) are instantiated, the 
network can be transformed into an equivalent singly-connected network (right). 
In the transformation process, a replica of an observed node is created for each 
child node. 



Figure 1 shows an equivalent singly-connected network resulting from assigning values to A 
and D. 

It is well-known that the minimum induced width w* of the network is always less than 
the size of the smallest loop-cutset (Bertele & Brioschi, 1972; Dechter, 2003). Namely, 
w* + 1 < \C\ for any C. Thus, inference approaches (e.g., bucket elimination) are never 
worse and often are better than cutset conditioning time-wise. However, when w* is too 
large we must resort to cutset conditioning search in order to trade space for time. Those 
considerations yield a hybrid search and inference approach. Since observed variables can 
break down the dependencies in the network, a network with an observed subset of variables 
C often can be transformed into an equivalent network with a smaller induced width, wc, 
which we will term the adjusted induced width. Hence, when any subset of variables C C X 
is observed, complexity is bounded exponentially by the adjusted induced width of the 
graph wc- 

Definition 2.5 (adjusted induced width) Given a graph G=<X ,E> , the adjusted 

induced width of G relative to C, denoted wc, is its induced width once C is removed 
from its moral graph. 

Definition 2.6 (ty-cutset) Given a graph G=<X,E>, a subset of nodes C C X is a 
ui-cutset of G if its adjusted induced width equals w. 

If C is a w-cutset, the quantities P(xi\c, e) and P(c\e) can be computed in time and 
space exponential in w, which can be much smaller than the tree- width of the unconditioned 
network. The resulting scheme requires memory exponential in w and time 0(d\ c \-N-d( w+i y) 
where N is the size of the network and d is the maximum domain size. Thus, the performance 
can be tuned to the available system memory resource via the bounding parameter w. 

Given a constant w, finding a minimal w-cutset (to minimize the cutset conditioning 
time) is also a hard problem. Several greedy heuristic approaches have been proposed by 
Geiger and Fishelson (2003) and by Bidyuk and Dechter (2003, 2004). We elaborate more 
in Section 3.5. 
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2.3 Gibbs Sampling 

Since the complexity of inference algorithms is memory exponential in the network's induced 
width (or tree-width) and since resorting to the cutset-conditioning scheme may take too 
much time when the ^-cutset size is too large, we must often resort to approximation 
methods. Sampling methods for Bayesian networks are commonly used approximation 
techniques. This section provides background on Gibbs sampling, a Markov Chain Monte 
Carlo method, which is one of the most popular sampling schemes and is the focus of this 
paper. Although the method may be applied to the networks with continuous distributions, 
we limit our attention in this paper to discrete random variables with finite domains. 

2.3.1 Gibbs Sampling for Bayesian Networks 



Ordered Gibbs Sampler 

Input: A belief network B over X={X±, ...,X n } and evidence e={(Xi = e{)\Xi G E C X}. 
Output: A set of samples {x®}, t = 1...T. 

1. Initialize: Assign random value xf^ to each variable Xi G X\E from PpQ). Assign 
evidence variables their observed values. 

2. Generate samples: 

For t = 1 to T, generate a new sample x®: 

For i = 1 to N, compute a new value xf^ for variable Xf. 

Compute distribution P(*i|a^ rfaw .) and sample xf <- P(Xi\x^ arkoVi ). 

Set Xi = xf\ 
End For i 

End For t 

Figure 2: A Gibbs sampling Algorithm 

Given a Bayesian network over the variables X = {Xi, ...,X n }, and evidence e, Gibbs 
sampling (Geman & Geman, 1984; Gilks et al., 1996; MacKay, 1996) generates a set of 
samples {x^} where each sample = {x± \ x$} is an instantiation of all the variables. 
The superscript t denotes a sample index and xf ^ is the value of Xi in sample t. The first 
sample can be initialized at random. When generating a new sample from sample xf\ a 
new value for variable Xi is sampled from probability distribution P{Xi\x < f} i ) (recall that 
P(Xi\x^\) = P(Xi\xf +1 \ ...,x$)) which we will denote as Xi <— P(Xi\x^\). 

The next sample xf +1 ^ is generated from the previous sample xf 1 following one of two 
schemes. 

Random Scan Gibbs Sampling. Given a sample at iteration t, pick a variable 
Xi at random and sample a new value Xi from the conditional distribution xi <— P(Xi\x^\) 
leaving other variables unchanged. 

Systematic Scan (Ordered) Gibbs Sampling. Given a sample x^\ sample a new 
value for each variable in some order: 

X! <- P(X 1 \x^,x^,...,x^ ) ) 
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x 2 «- P(X 2 \4 +1 \x<i\..., X M) 

x- <- PfXlx (m) x (m) x W x^l 

X« < P[X n \x^ \ %2 ^ i 1^) 

In Bayesian networks, the conditional distribution P(JQ|a^]) is dependent only on the 

assignment to the Markov blanket of variable X^. Thus, P{Xi\x^ i )=P(Xi\x^ arkm] ) where 

x markov ls ^ ne restriction of x^ to markovi. Given a Markov blanket of Xj, the sampling 
probability distribution is given explicitly by Pearl (1988): 

P(xi\x^ arkoVi ) = aP( Xi \x^) J] P{xf\x%) (9) 

Thus, generating a complete new sample can be done in 0(n ■ r) multiplication steps 
where r is the maximum family size and n is the number of variables. 

The sequence of samples x^ l \ x^ 2 \ ... can be viewed as a sequence of states in a Markov 
chain. The transition probability from state {xf +1 \ xf^\ xf\ xf^_ ± , x$} to state 

{x^ +1 \...,xf^\xf +1 \x^ l ,...,x$} is defined by the sampling distribution P(Xi\x^\). 
By construction, a Markov chain induced by Gibbs sampling has an invariant distribution 
P(X\e). However, since the values assigned by the Gibbs sampler to variables in a sample 
depend on the assignment of values in the previous sample it follows that the 
sample x^ depends on the initial state x^ ^. The convergence of the Markov chain is 
defined by the rate at which the distance between the distribution P(x ( - n ^\x < ^°\ e) and the 
stationary distribution P(X\e) (i.e., variational distance, L 1 -distance, or y 2 ) converges to 
as a function of n. Intuitively, it reflects how quickly the inital state x^ can be "forgotten." 
The convergence is guaranteed as T — > oo if Markov chain is ergodic (Pearl, 1988; Gelfand 
& Smith, 1990; MacKay, 1996). A Markov chain with a finite number of states is ergodic if 
it is aperiodic and irreducible (Liu, 2001). A Markov chain is aperiodic if it does not have 
regular loops. A Markov chain is irreducible if we can get from any state Si to any state 
Sj (including Si) with non-zero probability in a finite number of steps. The irreducibility 
guarantees that we will be able to visit (as number of samples increases) all statistically 
important regions of the state space. In Bayesian networks, the conditions are almost always 
satisfied as long as all conditional probabilities are positive (Tierney, 1994). 

To ensure that the collected samples are drawn from distribution close to P(X\e), a 
"burn-in" time may be allocated. Namely, assuming that it takes « K samples for the 
Markov chain to get close to the stationary distribution, the first K samples may not be in- 
cluded into the computation of posterior marginals. However, determining K is hard (Jones 
& Hobert, 2001). In general, the "burn-in" is optional in the sense that the convergence of 
the estimates to the correct posterior marginals does not depend on it. For completeness 
sake, the algorithm is given in Figure 2. 

When convergence conditions are satisfied, an ergodic average Jt(X) = y Ylt f(x t ) for 
any function f(X) is guaranteed to converge to the expected value E[f(X)] as T increases. 
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In other words, \fr(X) — E[f(X)]\ — ► as T — ► oo. For a finite-state Markov chain that is 
irreducible and aperiodic, the following result applies (see Liu, 2001, Theorem 12.7.2): 

Vf\f T (X) - E[f(X)}\ ^ N(0,a(ff) (10) 

for any initial assignment to x^°\ The variance term cr(/) 2 is defined as follows: 

a(f) 2 = 2r(/)a 2 

where a 2 = var[f(X)] and r(h) is the integrated autocorrelation time. 

Our focus is on computing the posterior marginals P(Xi\e) for each X^X\E. The 
posterior marginals can be estimated using either a histogram estimator. 

1 T 

P(X i = x i \e) = -J2S(x i \x {t) ) (11) 
t=i 

or a mixture estimator: 

1 T 

P{X i = x i \e) = -Y J P{xi\x { % (12) 
t=i 

The histogram estimator corresponds to counting samples where Xj = Xj, namely S(xi\x^) = 
1 if xf^ = xi and equals otherwise. Gelfand and Smith (1990) have pointed out that since 
mixture estimator is based on estimating conditional expectation, its sampling variance 
is smaller due to Rao-Blackwell theorem. Thus, mixture estimator should be preferred. 
Since P{xi\x^ i ) = P{xi\x^ arkov ), the mixture estimator is simply an average of conditional 
probabilities: 

1 T 

P(xi\e) = -J2P(xi\x^ arkoVi ) (13) 
t=i 

As mentioned above, when the Markov chain is ergodic, P(Xi\e) will converge to the exact 
posterior marginal P(Xi\e) as the number of samples increases. It was shown by Roberts and 
Sahu (1997) that random scan Gibbs sampler can be expected to converge faster than the 
systematic scan Gibbs sampler. Ultimately, the convergence rate of Gibbs sampler depends 
on the correlation between two consecutive samples (Liu, 1991; Schervish & Carlin, 1992; 
Liu et al., 1994). We review this subject in the next section. 

2.4 Variance Reduction Schemes 

The convergence rate of the Gibbs sampler depends on the strength of the correlations 
between the samples (which are also the states of the Markov chain). The term correlation 
is used here to mean that the samples are dependent, as mentioned earlier. In the case of a 
finite-state irreducible and aperiodic Markov chain, the convergence rate can be expressed 
through maximal correlation between states x^ and x^ (see Liu, 2001, ch. 12). In practice, 
the convergence rate can be analyzed through covariance cov [/(x^), f(x^ t+1 ^)], where / is 
some function, also called auto-covariance. 

The convergence of the estimates to the exact values depends on both the convergence 
rate of the Markov chain to the stationary distribution and the variance of the estimator. 
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Both of these factors contribute to the value of the term cr(/) 2 in Eq.(lO). The two main 
approaches that allow to reduce correlation between samples and reduce sampling variance 
of the estimates are blocking (grouping variables together and sampling simultaneously) 
and collapsing (integrating out some of the random variables and sampling a subset), also 
known as Rao-Blackwellisation. 

Given a joint probability distribution over three random variables X, Y, and Z, we can 
depict the essence of those three sampling schemes as follows: 



1. Standard Gibbs: 



x (t+i) _ P{X\y®,z®) (14) 
y (t+i) ^ P{Y\x^ t+1 \z^) (15) 
z (t+i) ^ P(Z\x^ t+1 \y^) (16) 



2. Collapsed (variable Z is integrated out): 



<_ P(X\y®) (17) 
<- P(Y\x^ t+1) ) (18) 



3. Blocking by grouping X and Y together: 



(a^+^j/t+i)) <_ P( X ,Y\z®) (19) 
z (*+i) <_ P(Z\x^ t+l \y^) (20) 



The blocking reduces the correlation between samples by grouping highly correlated 
variables into "blocks." In collapsing, the highly correlated variables are marginalized out, 
which also results in the smoothing of the sampling distributions of the remaining variables 
(P(Y\x) is smoother than P(Y\x,z)). Both approaches lead to reduction of the sampling 
variance of the estimates, speeding up their convergence to the exact values. 

Generally, blocking Gibbs sampling is expected to converge faster than standard Gibbs 
sampler (Liu et al., 1994; Roberts & Sahu, 1997). Variations on this scheme have been 
investigated by Jensen et al. (1995) and Kjaerulff (1995). Given the same number of samples, 
the estimate resulting from collapsed Gibbs sampler is expected to have lower variance 
(converge faster) than the estimate obtained from blocking Gibbs sampler (Liu et al., 1994). 
Thus, collapsing is preferred to blocking. The analysis of the collapsed Gibbs sampler can 
be found in Escobar (1994), MacEachern (1994), and Liu (1994, 1996). 

The caveat in the utilization of the collapsed Gibbs sampler is that computation of the 
probabilities P(X\y) and P(Y\x) must be efficient time-wise. In case of Bayesian networks, 
the task of integrating out some variables is equivalent to posterior belief updating where 
evidence variables and sampling variables are observed. Its time complexity is therefore 
exponential in the adjusted induced width, namely, in the effective width of the network 
after some dependencies are broken by instantiated variables (evidence and sampled). 
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2.5 Importance Sampling 

Since sampling from the target distribution is hard, different sampling methods explore 
different trade-offs in generating samples and obtaining estimates. As we already discussed, 
Gibbs sampling generates dependent samples but guarantees convergence of the sampling 
distribution to the target distribution. Alternative approach, called importance sampling, 
is to generate samples from a sampling distribution Q(X) that is different from P(X\e) and 
include the weight = P(x^\e)/Q(x^) of each sample x® in the computation of the 
estimates as follows: 



The convergence of /t(X) to E[f(X)} is guaranteed as long as the condition P(x\e) ^ 
=> Q(x) 7^ holds. The convergence speed depends on the distance between Q(X) and 



One of the simplest forms of importance sampling for Bayesian networks is likelihood 
weighting (Fung & Chang, 1989; Shachter & Peot, 1989) which processes variables in topo- 
logical order, sampling root variables from their priors and the remaining variables from 
conditional distribution P{Xi\pa,i) defined by their conditional probability table (the evi- 
dence variables are assigned their observed values). Its sampling distribution is close to the 
prior and, as a result, it usually converges slowly when the evidence is concentrated around 
the leaf nodes (nodes without children) and when the probability of evidence is small. Adap- 
tive (also called dynamic) importance sampling is a method that attempts to speed up the 
convergence by updating the sampling distribution based on the weight of previously gen- 
erated samples. Adaptive importance sampling methods include self-importance sampling, 
heuristic importance sampling (Shachter & Peot, 1989), and, more recently, AIS-BN (Cheng 
& Druzdzel, 2000) and EPIS-BN (Yuan & Druzdzel, 2003). In the empirical section, we 
compare the performance of the proposed cutset sampling algorithm with AIS-BN which 
is considered a state-of-the-art importance sampling algorithm to date (although EPIS-BN 
was shown to perform better in some networks) and, hence, describe AIS-BN here in more 
detail. 

AIS-BN algorithm is based on the observation that if we could sample each node in 
topological order from distribution P(Xi\pai, e), then the resulting sample would be drawn 
from the target distribution P(X\e). Since this distribution is unknown for any variable 
that has observed descendants, AIS-BN initializes the sampling distributions P°(Xi\pcii, e) 
equal to either P(Xi\pa,i) or a uniform distribution and then updates each distribution 
P k (Xi\pai, e) every / samples so that the next sampling distribution P k+1 (Xi\pai, e) will be 
closer to P(Xi\pai,e) than P k (Xi\pa{, e) as follows: 



where T](k) is a positive function that determines the learning rate and P'(xi\pai,e) is an 
estimate of P(xi\pai,e) based on the last / samples. 




(21) 



t=i 



P(X\e). 



P k+1 (xi\pai, e) = P k (xi\pai, e) + r)(k) ■ (P'(xi\pcn, e) - P k {xi\pa u e)) 
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3. Cutset Sampling 

This section presents the cutset sampling scheme. As we discussed above, sampling on a 
cutset is guaranteed to be more statistically efficient. Cutset sampling scheme is a compu- 
tationally efficient way of sampling from a "collapsed" variable subset C C X, tying the 
complexity of sample generation to the structure of the Bayesian network. 

3.1 Cutset Sampling Algorithm 

The cutset sampling scheme partitions the variable set X into two subsets C and X\C. 
The objective is to generate samples from space C={C\, C2, • C m } where each sample 
is an instantiation of all the variables in C. Following the Gibbs sampling principles, we 
wish to generate a new sample by sampling a value cf^ from the probability distribution 
P{d\c { % = P(d\c( +l \4 +1 \ c { ^\ cjjj, c#). We will use left arrow to denote that 
value Cj is drawn from distribution P(Ci\c^\): 

c t ^P(Ci\c%e) (22) 

If we can compute the probability distribution P(Ci\c^\, e) efficiently for each sampling 
variable Cj G C, then we can generate samples efficiently. The relevant conditional distri- 
butions can be computed by exact inference whose complexity is tied to the network struc- 
ture. We denote by JTC(B,Xi,e) a generic algorithm in the class of variable-elimination 
or join-tree clustering algorithms which, given a belief network B and evidence e, outputs 
the posterior probabilities P{Xi\e) for variable X{ G X (Lauritzen &: Spiegelhalter, 1988; 
Jensen et al., 1990; Dechter, 1999a). When the network's identity is clear, we will use the 
notation JTC(Xi,e). 



Cutset Sampling 




Input: A belief network B, a cutset C = {C\, C m }, evidence e. 




Output: A set of samples c t ,t = \...T. 




1. Initialize: Assign random value c° to each Cj G C and assign e. 




2. Generate samples: 




For t = to T-l, generate a new sample c( t+1 ) as follows: 




For i = 1 to m, compute new value cf^ for variable Cj as follows: 




a. Compute JTC(d,c%e). 




b. Compute P(Ci\c%e) = aP(d, c% e). 




-Sample: c f+» ^ P ( Ci \ c % e) 


(23) 


End For i 




End For t 





Figure 3: w-Cutset sampling Algorithm 



Therefore, for each sampling variable Cj and for each value Cj G T>(Ci), we can compute 
P(d,C-\,e) via JTC(Ci,c^\,e) and obtain P(Ci\c^\, e) via normalization: P(Ci\c^\,e) = 
aP(d,c%e). 
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Cutset sampling algorithm that uses systematic scan Gibbs sampler is given in Figure 3. 
Clearly, it can be adapted to be used with the random scan Gibbs sampler as well. Steps 
(a)-(c) generate sample (t + 1) from sample (t). For every variable d £ C in sequence, the 
main computation is in step (a), where the distribution P(Ci,c^\,e) over Cj is generated. 
This requires executing JTC for every value Cj € V(d), separately. In step (b), the 
conditional distribution is derived by normalization. Finally, step (c) samples a new value 
from the obtained distribution. Note that we only use P(Cj|c^, e) as a short-hand notation 

for P(Ci\cf +1 \ cf^\ cf\, c£\ e). Namely, when we sample a new value for variable 
Cj, the values of variables C\ through Cj_i have already been updated. 

We will next demonstrate the process using the special case of loop-cutset (see Defini- 
tion 2.1). 

Example 3.1 Consider the belief network previously shown in Figure 1 with the observed node 
E = e and loop-cutset {A,D}. We begin the sampling process by initializing sampling variables to 
a' ) and d^°K Next, we compute new sample values a^\d^ as follows: 



P(A\S°\e) = aP JTC (A,cW,e) (24) 

a (1) <- P(A\d (0) ,e) (25) 

P(D\a (1 \e) = aP JTC (D,a^,e) (26) 

d« <- P{D\a {1 \e) (27) 



The process above corresponds to two iterations of the inner loop in Figure 3. Eq. (24)-(25), where 
we sample a new value for variable A, correspond to steps (a) -(c) of the first iteration. In the second 
iteration, Eq.(26)-(27), we sample a new value for variable D. Since the conditioned network is a 
poly-tree (Figure 1, right), computing probabilities PjTc(A\d^\e) and PjTc{D\a^ +1 \e) via JTC 
reduces to Pearl's belief propagation algorithm and the distributions can be computed in linear time. 

3.2 Estimating Posterior Marginals 

Once a set of samples over a subset of variables C is generated, we can estimate the posterior 
marginals of any variable in the network using mixture estimator. For sampling variables, 
the estimator takes the form similar to Eq.(12): 

P{C i \e) = ^Y J P{C i \c%e) (28) 
t=i 

For variables in X\C, E, the posterior marginal estimator is: 

1 T 

P{X i \e) = -Y d P{X i \S,e) (29) 
t=i 

We can use JTC(Xi, e) to obtain the distribution P(Xi\c^\ e) over the input Bayesian 
network conditioned on c^-* and e as shown before. 

If we maintain a running sum of the computed distributions P(Ci\c^\, e) and P(Xi\c^ , e) 
during sample generation, the sums in the right hand side of Eq.(28)-(29) will be readily 
available. As we noted before, the estimators P(Ci\e) and P{Xi\e) are guaranteed to con- 
verge to their corresponding exact posterior marginals as T increases as long as the Markov 
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chain over the cutset C is ergodic. While for the cutset variables the estimator is a simple 
ergodic average, for X; L G X\C, E the convergence can also be derived directly from first 
principles: 

Theorem 3.2 Given a Bayesian network B over X , evidence variables E C X , and cutset 
C C X\E, and given a set of T samples c^ 2 \ obtained via Gibbs sampling from 

P(C\e), and assuming the Markov chain corresponding to sampling from C is ergodic, then 
for any X, G X\C,E assuming P(Xi\E) is defined by Eq.(29), PpQ|e) — > PpQ|e) as 
T -> oo . 

Proof. By definition: 

P(^| e ) = i^P(X l | C W,e) (30) 
t=\ 

Instead of summing over samples, we can rewrite the expression above to sum over all 
possible tuples c £ T>(C) and group together the samples corresponding to the same tuple 
instance c. Let q(c) denote the number of times a tuple C = c occurs in the set of samples 
so that J2 c ev(c) ^( c ) = ^ ■ It is easy to see that: 

P(X i \e)= £ ppQ| c , e )^ (31) 
cev(c) 

The fraction is a histogram estimator for the posterior marginal P(c|e). Thus, we get: 

P(X i \e)= P(Xi\c,e)P{c\e) (32) 

cev(c) 

Since the Markov chain formed by samples from C is ergodic, P(c|e) 
and therefore: 

P(X t \e)^ P(X i \c,e)P{c\e) = P(X l \e) 

cev(c) 

□ 



3.3 Complexity 

The time and space complexity of generating samples and estimating the posterior marginals 
via cutset sampling is dominated by the complexity of JTC in line (a) of the algorithm 
(Figure 3). Only linear amount of additional memory is required to maintain the running 
sums of P(Ci\c^\, e) and P(Xi\c^\ e) used in the posterior marginal estimators. 

3.3.1 Sample Generation Complexity 

Clearly, when JTC is applied to the network B conditioned on all the cutset variables C 
and evidence variables E, its complexity is time and space exponential in the induced width 
w of the conditioned network. It is 0(N ■ d( w+1 ^) when C is a tu-cutset (see Definition 2.6). 



-► P(c|e) as T -> oo 
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Using the notion of a iw-cutset, we can balance sampling and exact inference. At one end 
of the spectrum we have plain Gibbs sampling where sample generation is fast, requiring 
linear space, but may have high variance. At the other end, we have exact algorithm 
requiring time and space exponential in the induced width of the moral graph. In between 
these two extremes, we can control the time and space complexity using w as follows. 

Theorem 3.3 (Complexity of sample generation) Given a network B over X , evi- 
dence E, and a w-cutset C , the complexity of generating a new sample is time and space 
0{\C\ ■ N ■ d( w+2 ^) where d bounds the variable's domain size and N = \X\. 

Proof. If C is a w/-cutset and d is the maximum domain size, then the complexity of 
computing joint probability P(ci,c^\,e) over the conditioned network is 0(N • d^ w+1 ^). 
Since this operation must be repeated for each a G V(Ci), the complexity of processing one 
variable (computing distribution P(d\c%e)) is 0(N ■ d ■ d^ w+1 ^) = 0(N ■ d^ w+2 ^>). Finally, 
since ordered Gibbs sampling requires sampling each variable in the cutset, generating one 
sample is 0(\C\ -N-d^+V). □ 



3.3.2 Complexity of Estimator Computation 

The posterior marginals for any cutset variable Cj G C are easily obtained at the end of 
sampling process without incurring additional computation overhead. As mentioned earlier, 
we only need to maintain a running sum of probabilities P(cj|c^],e) for each a G T>(Ci). 
Estimating P(Aj|e), Xi G X\C,E, using Eq.(29) requires computing P(Aj|c^,e) once a 
sample is generated. In summary: 

Theorem 3.4 (Computing Marginals) Given a w-cutset C, the complexity of comput- 
ing posteriors for all variables Xi G X\E using T samples over the cutset variables is 
0(T ■ [\C\ + d] ■ N ■ d( w+1 ^). 

Proof. As we showed in Theorem 3.3, the complexity of generating one sample is 0(\C\ ■ 
N ■ d^ w+2 ^). Once a sample is generated, the computation of the posterior marginals 
for the remaining variables requires computing P{Xi\c <yt \e) via JTC{Xi,c <yt \e) which is 
0(N ■ d( w+1 ^). The combined computation time for one sample is 0(\C\ ■ N ■ d( w+ ^ + 
N ■ = 0([\C\ + d] ■ N ■ d( w+1 1). Repeating the computation for T samples, yields 

0(T- [|C| +d] • N □ 

Note that the space complexity of tu-cutset sampling is bounded by 0(N ■ d^ w+1 ^). 

3.3.3 Complexity of Loop-Cutset 

When the cutset C is a loop-cutset, algorithm JTC reduces to belief propagation (Pearl, 
1988) that computes the joint distribution P(Cj, c^], e) in linear time. We will refer to the 
special case as loop-cutset sampling and to the general as w-cutset sampling. 

A loop-cutset is also a ^-cutset where w equals the maximum number of unobserved 
parents (upper bounded by the maximum indegree of a node). However, since processing 
poly-trees is linear even for large w, the induced width does not capture its complexity 
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properly. The notion of loop-cutset could be better captured via the hyperwidth of the 
network (Gottlob, Leone, &: Scarello, 1999; Kask et al., 2005). The hyperwidth of a poly- 
tree is 1 and therefore, a loop-cutset can be defined as a 1-hypercutset. Alternatively, we 
can express the complexity via the network's input size M which captures the total size of 
conditional probability tables to be processed as follows: 

Theorem 3.5 (Complexity of loop-cutset sample generation) If C is a loop-cutset, 
the complexity of generating each sample is 0(\C\ • al • M) where M is the size of the input 
network. 

Proof. When a loop-cutset of a network is instantiated, belief propagation (BP) can 
compute the joint probability P(cj,c^,e) in linear time O(M) (Pearl, 1988) yielding total 
time and space of 0(\C\ ■ d ■ M) for each sample. □ 



3.4 Optimizing Cutset Sampling Performance 

Our analysis of the complexity of generating samples (Theorem 3.3) is overly pessimistic in 
assuming that the computation of the sampling distribution for each variable in the cutset 
is independent. While all variables may change a value when moving from one sample to 
the next, the change occurs one variable at a time in some sequence so that much of the 
computation can be retained when moving from one variable to the next . 

We will now show that sampling all the cutset variables can be done more efficiently 
reducing the factor of N ■ \ C\ in Theorem 3.3 to {N + \C\ -5) where 5 bounds the number of 
clusters in the tree decomposition used by JTC that contains any node Ci G C. We assume 
that we can control the order by which cutset variables are sampled. 




Figure 4: A Bayesian network (top) and corresponding cluster-tree (bottom). 

Consider a simple network with variables X={X±, ....X n }, Y={Y±, Y n -±} and CPTs 
P(Xi + i\ Xi, Yi) and P(Yi + i\Xi) defined for every i as shown in Figure 4, top. The join-tree 
of this network is a chain of cliques of size 3 given in Figure 4, bottom. Since Y is a loop- 
cutset, we will sample variables in Y . Let's assume that we use the ordering Yi, Y2, ...Y ri _i to 
generate a sample. Given the current sample, we are ready to generate the next sample by 
applying JTC (or bucket-elimination) to the network whose cutset variables are assigned. 



17 



BlDYUK & DECHTER 



This makes the network effectively singly-connected and leaves only 2 actual variables in 
each cluster. The algorithm sends a message from the cluster containing X n towards the 
cluster containing X\. When cluster (Xi,X2,Y±) gets the relevant message from cluster 
(X2, X3, Y2) we can sample Y\. This can be accomplished by d linear computations in clique 
(X\,X2,Y\) for each m G V{Yi) yielding the desired distribution P(Y\\.) (we can multiply 
all functions and incoming messages in this cluster, sum out X\ and X2 and normalize). If 
the cutset is a ^-cutset, each computation in a single clique is 0(d^ w+1 ^). 

Once we have P(Yi\-), Y\ is sampled and assigned a new value, y±. Cluster (X\,X2, Y\ = 
yi) then sends a message to cluster (X2, X3, Y2) which now has all the information necessary 
to compute P(Y 2 \.) in 0(d (w+2) ). Once P{Y 2 \.) is available, a new value Y2 = 7/2 is sampled. 
The cluster than computes and sends a message to cluster (X3, X4,Y^), and so on. At the 
end, we obtain a full sample via two message passes over the conditioned network having 
computation complexity of 0(N ■ d^^). This example can be generalized as follows. 

Theorem 3.6 Given a Bayesian network having N variables, a w-cutset C , a tree- decomposition 
T r , and given a sample c\, c\ c \, a new sample can be generated in 0((N+\C\ ■ S) ■ d^ 2 )) 
where 5 is the maximum number of clusters containing any variable C{ G C . 

Proof. Given tt;-cutset C, by definition, there exists a tree-decomposition T r of the network 
(that includes the cutset variables) such that when the cutset variables C are removed, the 
number of variables remaining in each cluster of T r is bounded by w + 1. Let's impose 
directionality on T r starting at an arbitrary cluster that we call R as shown in Figure 5. Let 
Tc i denote the connected subtree of T r whose clusters include Cj. In Figure 5, for clarity, 
we collapse the subtree over Cj into a single node. We will assume that cutset nodes are 
sampled in depth-first traversal order dictated by the cluster tree rooted in R. 




Figure 5: A cluster-tree rooted in cluster R where a subtree over each cutset node C\ is 
collapsed into a single node marked Tc 4 . 
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Given a sample JTC will send messages from leaves of T r towards the root cluster. 
We can assume without loss of generality that R contains cutset node C\ which is the first to 
be sampled in c^ t+1 \ JTC will now pass messages from root down only to clusters restricted 
to Tc 1 (note that R £ ?b 1 ). Based on these messages P{C\ = c\,c^\) can be computed 
in 0(S W+1 ^). We will repeat this computation for each other value of C\ involving only 
clusters in Tc l and obtain the distribution P(C\\-) in 0(S W+2 ^) and sample a new value 
for C\. Thus, if C\ appears in 5 clusters, the number of message passing computations 
(after the initial O(N) pass) is 0(5) and we can generate the first distribution P(C±\-) in 
0(5 • S w+ V). 

The next node in the depth-first traversal order is Tq 2 and thus, the second variable to 
be sampled is C2. The distance between variables C\ and C2, denoted disti^, is the shortest 
path along T r from a cluster that contains C\ to a cluster that contains Ci- We apply JTC's 
mesage-passing along that path only which will take at most 0(dist\^ • d^ w+1 ^). Then, to 
obtain the conditional distribution P(C2\-), we will recompute messages in the subtree of 
Tc 2 for each value C2 £ D(C<i) in 0(5 ■ d^ w+2 ^). We continue the computation in a similar 
manner for other cutset nodes. 

If JTC traverses the tree in the depth-first order, it only needs to pass messages along 
each edge twice (see Figure 5). Thus, the sum of all distances traveled is ^=2 distij-i = 
O(N). What may be repeated is the computation for each value of the sampled variable. 
This, however, can be accomplished via message-passing restricted to individual variables' 
subtrees and is bounded by its 5. We can conclude that a new full sample can be generated 
in 0((N+ \C\ -5) •d( lu + 2 )). □ 



It is worthwhile noting that the complexity of generating a sample can be further reduced 
by a factor of dj (d— 1) (which amounts to a factor of 2 when d = 2) by noticing that whenever 

we move from variable Ci to Cj+i, the joint probability P(cf +1 \ cf +1 \ cfh, cj^) is 
already available from the previous round and should not be recomputed. We only need 
to compute P(cf +1 \ cf +1 \ Cj+i, cj^) for Cj+i / c f+v Buffering the last computed 
joint probability, we only need to apply JTC algorithm d — 1 times. Therefore, the total 
complexity of generating a new sample is 0((N + \C\ ■ 5) ■ (d — 1) • d^ w+1 ^). 

Example 3.7 Figure 6 demonstrates the application of the enhancements discussed. It 
depicts the moral graph (a), already triangulated, and the corresponding join-tree (b) for the 
Bayesian network in Figure 1. With evidence variable E removed, variables B and D form 
a 1-cutset. The join-tree of the network with cutset and evidence variables removed is shown 
in Figure 6 (c). Since removing D and E from cluster DFE leaves only one variable, F, 
we combine clusters BDF and DFE into one cluster, FG. Assuming that cutset variables 
have domains of size 2, we can initialize B = b° and D = dP. 

Selecting cluster AC as the root of the tree, JTC first propagates messages from leaves 
to the root as shown in Figure 6 (c) and then computes P(b°,d°,e) in cluster AC. Next, we 
set B = b 1 ; updating all functions containing variable B, and propagating messages through 
the subtree of B consisting of clusters AC and CF (Figure 6 (d)), we obtain P(6 1 ,d°,e). 
Normalizing the two joint probabilities, we obtain distribution P(B\d°, e) and sample a new 
value of B. Assume we sampled value b 1 . 
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Figure 6: A join-tree of width 2 (b) for a moral graph (a) is transformed into a join-tree of 
width 1 (c) when evidence variable E and cutset variables B and D are instanti- 
ated (in the process, clusters BDF and BCF are merged into cluster CF). The 
clusters contain variables and functions from the original network. The cutset 
nodes have domains of size 2, D(B) = {6°, b 1 }, T>(D) = {dP^d 1 }. Starting with a 
sample {b°,d }, messages are propagated in (c)-(e) to first, sample a new value 
of variable B (d) and then variable D (e) . Then messages are propagated up the 
tree to compute posterior marginals P(-|& 1 , d°, e) for the rest of the variables (f). 



Next, we need to compute P(D\b 1 ,e) to sample a new value for variable D. The joint 
probability P(d°, b 1 , e) is readily available since it was computed for sampling a new value of 
B. Thus, we set D = d 1 and compute the second probability P(d 1 1 b 1 , e) updating functions 
in clusters CF and FG and sending an updated message from CF to FG (Figure 6 (e)). 
We obtain distribution P(D\b 1 ,e) by normalizing the joint probabilities and sample a new 
value d? for D. Since the value has changed from latest computation, we update again 
functions in the clusters CF and FG and propagate updated messages in the subtree Cd 
(send message from CF to FG). 

In order to obtain the distributions P(-|6 1 , d°, e) for the remaining variables A, C, F, 
and G, we only need to send updated messages up the join-tree, from FG to CF and then 
from CF to AC as shown in Figure 6 (f). The last step also serves as the initialization 
step for the next sample generation. 

In this example the performance of cutset sampling is significantly better than its worst 
case. We have sent a total of 5 messages to generate a new sample while the worst case 
suggests at least N • \ C\ • d = 3 • 2 • 2 = 12 messages (here, N equals the number of clusters). 
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3.5 On finding A u>-Cutset 

Clearly, u>-cutset sampling will be effective only when the w-cutset is small. This calls for 
the task of finding a minimum size w;-cutset. The problem is NP-hard; yet, several heuristic 
algorithms have been proposed. We next briefly survey some of those proposals. 

Larossa and Dechter (2003) obtain w-cutset when processing variables in the elimination 
order. The next node to be eliminated (selected using some triangulation heuristics) is added 
to the cutset if its current induced width (or degree) is greater than w. Geiger and Fishelson 
(2003) agument this idea with various heuristics. 

Bidyuk and Dechter (2003) select the variables to be included in the cutset using greedy 
heuristics based on the node's basic graph properties (such as the degree of a node). One 
scheme starts from an empty w-cutset and then heuristically adds nodes to the cutset until 
a tree-decomposition of width < w can be obtained. The other scheme starts from a set 
C = X\E containing all nodes in the network as a cutset and then removes nodes from the 
set in some order. The algorithm stops when removing the next node would result in a tree 
decomposition of width > w. 

Alternatively, Bidyuk and Dechter (2004) proposed to first obtain a tree-decomposition 
of the network and then find the minimal u>-cutset of the tree-decomposition (also an NP- 
hard problem) via a well-known greedy algorithm used for set cover problem. This approach 
is shown to yield a smaller cutset than previously proposed heuristics and is used for finding 
ty-cutset in our experiments (section 4.4) with a modification that a tree-decomposition is 
re-computed each time a node is removed from the tree and added to the w-cutset. 

4. Experiments 

In this section, we present empirical studies of cutset sampling algorithms for several classes 
of problems. We use the mean square error of the posterior marginals' estimates as a 
measure of accuracy. We compare with traditional Gibbs sampling, likelihood weighting 
(Fung & Chang, 1989; Shachter & Peot, 1989), and the state of the art AIS-BN adaptive 
importance sampling algorithm (Cheng & Druzdzel, 2000). We implemented AIS-BN using 
the parameters specified by Cheng and Druzdzel (2000). By using our own implementation, 
we made sure that all sampling algorithms used the same data access routines and the 
same error measures providing a uniform framework for comparing their performance. For 
reference we also report the performance of Iterative Belief Propagation (IBP) algorithm. 

4.1 Methodology 

In this section we detail describe methodology used and the implementation decisions made 
that apply to the collection of the empirical results. 

4.1.1 Sampling Methodology 

In all sampling algorithms we restarted the Markov chain every T samples. The samples 
from each chain (batch) m are averaged separately: 




t=i 
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The final estimate is obtained as a sample average over M chains: 



1 M 

P{xi\e) = — Pm(xi\e) 



m=l 



Restarting a Markov chain is known to improve the sampling convergence rate. A single 
chain can become "stuck" generating samples from a single high-probability region without 
ever exploring large number of other high-probability tuples. By restarting a Markov chain 
at a different random point, a sampling algorithm can achieve a better coverage of the 
sampling space. In our experiments, we did not observe any significant difference in the 
estimates obtained from a single chain of size M ■ T or M chains of size T and therefore, 
we only choose to report the results for multiple Markov chains. However, we rely on the 
independence of random values P m (xi\e) to estimate 90% confidence interval for P(xi\e). 

In our implementation of Gibbs sampling schemes, we use zero "burn-in" time (see 
section 2.3.1). As we mentioned earlier, the idea of burn-in time is to throw away the 
first K samples to ensure that the remaining samples are drawn from distribution close 
to target distribution P(X\e). While conservative methods for estimating K through drift 
and minorization conditions were proposed by Rosenthal (1995) and Roberts and Tweedie 
(1999, 2001), the required analysis is beyond the scope of this paper. We consider our 
comparison between Gibbs sampling and cutset sampling, which is the objective, fair in the 
sense that both schemes use K=0. Further, our experimental results showed no positive 
indication that burn-in time would be beneficial. In practice, burn-in is the "pre-processing" 
time used by the algorithm to find the high-probability regions in the distribution P{C\e) 
in case it initially spends disproportionally large period of time in low probability regions. 
Discarding a large number of low-probability tuples obtained initially, the frequency of the 
remaining high-probability tuples is automatically adjusted to better reflect their weight. 
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Figure 7: Comparing loop-cutset sampling MSE vs. number of samples (left) and and num- 
ber of unique samples vs. number of samples (right) in cpcs360b. Results are 
averaged over 10 instances with different observations. 



In our benchmarks, we observed that both full Gibbs sampling and cutset sampling 
were able to find high probability tuples fast relative to the number of samples generated. 
For example, in one of the benchmarks, cpcs360b, the rate of generating unique samples, 
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namely, the ratio of cutset instances that have not been seen to the number of samples, 
decreases over time. Specifically, loop-cutset sampling generates 200 unique tuples after the 
first 1000 samples, an additional 100 unique tuples while generating the next 1000 samples, 
and then the rate of generating unique tuples slows to 50 per 1000 samples in the range 
from 2000 to 10000 samples as shown in Figure 7, right. That means that after the first 
few hundred samples, the algorithm spends most of the time revisiting high-probability 
tuples. In other benchmarks, the number of unique tuple instances generated increases 
linearly (as in cpcs54) and, thus, the tuples appear to be distributed nearly uniformly. In 
this case, there is no need for burn-in because there are no strongly-expressed heavy-weight 
tuples. Instead of using burn-in times, we sample initial variable values from the posterior 
marginal estimates generated by IBP in all of our experiments. Our sampling time includes 
the pre-processing time of IBP. 

All experiments were performed on 1.8 GHz CPU. 

4.1.2 Measures of Performance 

For each problem instance defined by a Bayesian network B having variables X = {X±, X n } 
and evidence E C X, we derived the exact posterior marginals P(Xi\e) using bucket-tree 
elimination (Dechter, 2003, 1999a) and computed the mean square error (MSE) of the 
approximate posterior marginals PpQ|e) for each algorithm where MSE is defined by: 

MSE = ^ — 1 Wix . M E Ei p N e )-^i e )] 2 

l^Xi&X\E I V i)\ Xi eX\EV(Xi) 

While the mean square error is our primary accuracy measure, the results are consistent 
across other well-known measures such as average absolute error, KL-distance, and squared 
Hellinger's distance which we show only for loop-cutset sampling. The absolute error A is 
averaged over all values of all unobserved variables: 

A= V 1 \V(XM ^ £ \P( Xi \e) - P{xi\e)\ 

KL-distance Dk between the distribution P(JQ|e) and the estimator P(Xi\e) is defined as 
follows: 

D K (P(X l \e),P(X l \e)) = £ P( x ,|e)lg£^M 

For each benchmark instance, we compute the KL-distance for each variable Xi £ X\E and 
then average the results: 

Dk{P ' P)= \xXe\ £ ^(P(^|e),PpQ|e)) 

' ^ ' Xi£X\E 

The squared Hellinger's distance Dh between the distribution P(X,i\e) and the estimator 
P(Xi\e) is obtained as: 

D H (P{X i \e),P(X l \e))= £ y P{ Xl \e) - ^ P{ Xl \e)) 2 

V(Xi) 
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The average squared Hellinger's distance for a benchmark instance is the average of the 
distances between posterior distributions of one variable: 

' ^ ' Xi<=X\E 

The average errors for different network instances are then averaged over all instances of 
the given network (typically, 20 instances). 

We also report the confidence interval for the estimate P(xi\e) using an approach similar 
to the well-known batch means method (Billingsley, 1968; Geyer, 1992; Steiger & Wilson, 
2001). Since chains are restarted independently, the estimates P m {xi\e) are independent. 
Thus, the confidence interval can be obtained by measuring the variance in the estimators 
P(Xi\e). We report results in Section 4.5. 

4.2 Benchmarks 

We experimented with four classes of networks: 

CPCS. We considered four CPCS networks derived from the Computer-based Patient 
Case Simulation system (Parker & Miller, 1987; Pradhan, Provan, Middleton, h, Henrion, 
1994). CPCS network representation is based on INTERNIST 1 (Miller, Pople, & Myers, 
1982) and Quick Medical Reference (QMR) (Miller, Masarie, & Myers, 1986) expert sys- 
tems. The nodes in CPCS networks correspond to diseases and findings and conditional 
probabilities describe their correlations. The cpcs54 network consists of iV=54 nodes and 
has a relatively large loop-cutset of size |LC|=16 (> 25% of the nodes). Its induced width 
is 15. cpcsl79 network consists of iV=179 nodes. Its induced width is w*=8. It has a 
small loop-cutset of size |LC|=8 but with a relatively large corresponding adjusted induced 
width wlc=7- The cpcs360b is a larger CPCS network with 360 nodes, adjusted induced 
width of 21, and loop-cutset \LC\=2Q. Exact inference on cpcs360b averaged ~ 30 minutes. 
The largest network, cpcs422b, consisted of 422 nodes with induced width w*=22 and 
loop-cutset of size 47. The exact inference time for cpcs422b is about 50 minutes. 

Hailfinder network. It is a small network with only 56 nodes. The exact inference in 
Hailfinder network is easy since its loop-cutset size is only 5. Yet, this network has some 
zero probabilities and, therefore, is a good benchmark for demonstrating the convergence 
of cutset sampling in contrast to Gibbs sampling. 

Random networks. We experimented with several classes of random networks: ran- 
dom networks, 2-layer networks, and grid networks. The random networks were generated 
with iV=200 binary nodes (domains of size 2). The first 100 nodes, {X±, Xiqo}, were 
designated as root nodes. Each non-root node Xi , i > 100, was assigned 3 parents selected 
randomly from the list of predecessors {X\, We will refer to this class of random 
networks as multi-partite random networks to distinguish from bi-partite (2-layer) random 
networks. The random 2-layer networks were generated with 50 root nodes (first layer) 
and 150 leaf nodes (second layer), yielding a total of 200 nodes. A sample 2-layer random 
network is shown in Figure 8, left. Each non-root node (second layer) was assigned 1-3 
parents selected at random among the root nodes. All nodes were assigned a domain of size 
2,V{X l ) = {xlx}}. 
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Figure 8: Sample random networks: 2-layer (left), grid (center), coding (right). 



For both 2-layer and multi-partite random networks, the root nodes were assigned uni- 
form priors while conditional probabilities were chosen randomly. Namely, each value 
P(x t -\pai) was drawn from uniform distribution over interval (0,1) and used to compute 
the complementary probability value P(x\\pa,i) = 1 — P(x®\pa,i). 

The directed grid networks (as opposed to grid-shaped undirected Markov Random 
Fields) of size 15x30 with 450 nodes were also constructed with uniform priors (on the single 
root node) and random conditional probability tables (as described above). A sample grid 
network is shown in Figure 8, center. Those networks had an average induced width of 
size 20 (exact inference using bucket elimination required about 30 minutes) . They had the 
most regular structure of all and the largest loop-cutset containing nearly a half of all the 
unobserved nodes. 

Coding networks. We experimented with coding networks with 50 code bits and 50 
parity check bits. The parity check matrix was randomized; each parity check bit had three 
parents. A sample coding network with 4 code bits, 4 parity checking bits, and total of 8 
transmitted bits is shown in Figure 8, center. The total number of variables in each network 
in our experiments was 200 (50 code bits, 50 parity check bits, and 1 transmitted bit for 
each code or parity check bit). An average loop-cutset size was 26 and induced width was 
21. The Markov chain produced by Gibbs sampling over the whole coding network is not 
ergodic due to the deterministic parity check function. As a result, Gibbs sampling does 
not converge. However, the Markov chain corresponding to sampling the subspace of coding 
bits only is ergodic and, thus, all of the cutset sampling schemes have converged as we will 
show in the next two sections. 

In all networks, except coding and grid networks, evidence nodes were selected at random 
among the leaf nodes (nodes without children). Since a grid network has only one leaf 
node, the evidence in the grid networks was selected at random among all nodes. For each 
benchmark, we report on the chart title the number of nodes in the network N, average 
number of evidence nodes \E\, size of loop-cutset \LC\, and average induced width of the 
input instance denoted w* to distinguish from the induced width w of the network adjusted 
for its ic-cutset. 
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4.3 Results for Loop-Cutset Sampling 

In this section we compare loop-cutset sampling with pure Gibbs sampling, likelihood 
weighting, AIS-BN, and IBP. In all benchmarks, the cutset was selected so that the evidence 
and sampling nodes together constitute a loop-cutset of the network using the algorithm 
proposed by Becker et al. (2000). We show the accuracy of Gibbs and loop-cutset sampling 
as a function of the number of samples and time. 

CPCS networks. The results are summarized in Figures 9-12. The loop-cutset curve 
in each chart is denoted LCS (for Loop Cutset Sampling). The induced width of the network 
wlc when loop-cutset nodes are observed is specified in the caption. It is identical to the 
largest family size in the poly-tree generated when cutset variables are removed. We plot 
the time on the x-axis and the accuracy (MSE) on the y-axis. In the CPCS networks, IBP 
always converged and converged fast (within seconds). Consequently, IBP curve is always 
a straight horizontal line as the results do not change after the convergence is achieved. 
The curves corresponding to Gibbs sampling, loop-cutset sampling, likelihood weighting, 
and AIS-BN demonstrate the convergence of the sampling schemes with time. In the three 
CPCS networks loop-cutset sampling converges much faster than Gibbs sampling. The only 
exception is cpcs422b (Figure 12, right) where the induced width of the conditioned singly- 
connected network remains high (wlc = 14) due to large family sizes and thus, loop-cutset 
sampling generates samples very slowly (4 samples/second) compared to Gibbs sampling 
(300 samples/second). Since computing sampling distribution is exponential in w, sampling 
a single variable is 0(2 14 ) (all variables have domains of size 2). As a result, although loop- 
cutset sampling shows a significant reduction in MSE as a function of the number of samples 
(Figure 12, left), it is not enough to compensate for the two orders of magnitude difference 
in the loop-cutset rate of sample generation. For cpcs54 (Figure 9), cpcsl79 (Figure 10), 
and cpcs360b (Figure 11) loop-cutset sampling achieves greater accuracy than IBP within 
10 seconds or less. 

In comparison with importance sampling schemes, we observe that the AIS-BN algo- 
rithm consistently outperforms likelihood weighting and AIS-BN is slightly better than loop- 
cutset sampling in cpcs54, where the probability of evidence P(e)=0.0928 is relatively high. 
In cpcsl79, where probability of evidence P(e)=4E-05 is smaller, LCS outperforms AIS-BN 
while Gibbs sampling curves falls in between AIS-BN and likelihood weighting. Both Gibbs 
sampling and loop-cutset sampling outperform AIS-BN in cpcs360b and cpcs422b where 
probability of evidence is small. In cpcs360b average P(e)=5E-8 and in cpcs422b the prob- 
ability of evidence varies from 4E-17 to 8E-47. Note that likelihood weighting and AIS-BN 
performed considerably worse than either Gibbs sampling or loop-cutset sampling in all of 
those benchmarks as a function of the number of samples. Consequently, we left them off 
the charts showing the convergence of Gibbs and loop-cutset sampling as a function of the 
number of samples in order to zoom in on the two algorithms which are the focus of the 
empirical studies. 

Coding Networks. The results for coding networks are shown in Figure 13. We 
computed error measures over all coding bits and averaged over 100 instances (10 instances, 
with different observed values, of each of the 10 networks with different coding matrices). As 
we noted earlier, the Markov chains corresponding to Gibbs sampling over coding networks 
are not ergodic and, as a result, Gibbs sampling does not converge. However, the Markov 
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cpcs54, N=54, |LC|=16, w*=15, |E|=8 
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Figure 9: Comparing loop-cutset sampling (LCS), wlc=5, Gibbs sampling (hereby referred 
to as Gibbs), likelihood weighting (LW), AIS-BN, and IBP on cpcs54 network, 
averaged over 20 instances, showing MSE as a function of the number of samples 
(top left) and time (top right) and KL-distance (middle left), squared Hellinger's 
distance (middle right), and an average error (bottom) as a function of time. 
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Figure 10: Comparing loop-cutset sampling (LCS), wlc=7, Gibbs sampling, likelihood 
weighting (LW), AIS-BN, and IBP on cpcsl79 network, averaged over 20 in- 
stances, showing MSE as a function of the number of samples (top left) and time 
(top right) and KL-distance (middle left), squared Hellinger's distance (middle 
right), and an average error (bottom) as a function of time. 
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cpcs360b, N=360, |LC|=26, w*=21, |E|=15 
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Figure 11: Comparing loop-cutset sampling (LCS), u>lc = 3, Gibbs sampling, likelihood 
weighting (LW), AIS-BN, and IBP on cpcs360b network, averaged over 20 in- 
stances, showing MSE as a function of the number of samples (top left) and time 
(top right) and KL-distance (middle left), squared Hellinger's distance (middle 
right), and an average error (bottom) as a function of time. 
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Figure 12: Comparing loop-cutset sampling (LCS), wlc=^i Gibbs sampling, likelihood 
weighting (LW), AIS-BN sampling, and IBP on cpcs422b network, averaged 
over 10 instances, showing MSE as a function of the number of samples (top 
left) and time (top right) and KL-distance (middle left), squared Hellinger's 
distance (middle right), and an average error (bottom) as a function of time. 
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Figure 13: Comparing loop-cutset sampling (LCS), wlc=3, Gibbs sampling, likelihood 
weighting (LW), AIS-BN, and IBP on coding networks, <r=0.4, averaged over 
10 instances of 10 coding networks (100 instances total). The graphs show aver- 
age absolute error ( top left), MSE (top right), KL-distance (bottom left), and 
squared Hellinger's distance (bottom right) as a function of time. 



chain corresponding to sampling the subspace of code bits only is ergodic and therefore, 
loop-cutset sampling, which samples a subset of coding bits, converges and even achieves 
higher accuracy than IBP with time. In reality, IBP is certainly preferable for coding 
networks since the size of the loop-cutset grows linearly with the number of code bits. 

Random networks. In random multi-part networks (Figure 14, top) and random 
2-layer networks (Figure 14, middle), loop-cutset sampling always converged faster than 
Gibbs sampling. The results are averaged over 10 instances of each network type. In 
both cases, loop-cutset sampling achieved accuracy of IBP in 2 seconds or less. In 2-layer 
networks, Iterative Belief Propagation performed particularly poorly. Both Gibbs sampling 
and loop-cutset sampling obtained more accurate results in less than a second. 

Hailflnder network. We used this network (in addition to coding networks) to com- 
pare the behavior of cutset sampling and Gibbs sampling in deterministic networks. Since 
Hailflnder network contains many deterministic probabilities, the Markov chain correspond- 
ing to Gibbs sampling over all variables is non-ergodic. As expected, Gibbs sampling fails 
while loop-cutset sampling computes more accurate marginals (Figure 15). 
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Figure 14: Comparing loop-cutset sampling (LCS), Gibbs sampling, and IBP on random 
networks (left) and 2-layer random networks (right), wlc=3 in both classes of 
networks, averaged over 10 instances each. MSE as a function of time. 
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Figure 15: Comparing loop-cutset sampling (LCS), u>lc=7, Gibbs sampling, and IBP on 
Hailfinder network, 10 instances. MSE as a function of time. 



In summary, the empirical results demonstrate that loop-cutset sampling is cost-effective 
time-wise and superior to Gibbs sampling. We measured the ratio R = j^- of the number 
of samples M g generated by Gibbs to the number of samples M c generated by loop-cutset 
sampling in the same time period (it is relatively constant for any given network and only 
changes slightly between problem instances that differ with observations). For cpcs54, 
cpcsl79, cpcs360b, and cpcs422b the ratios were correspondingly 2.5, 3.75, 0.7, and 75 
(see Table 2 in section 4.4). We also obtained R=2.0 for random networks and R=0.3 for 
random 2-layer networks. The ratio values > 1 indicate that the Gibbs sampler generates 
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samples faster than loop-cutset sampling which is usually the case. In those instances, 
variance reduction compensated for the increased computation time because fewer samples 
are needed to converge resulting in the overall better performance of loop-cutset sampling 
compared to Gibbs sampling. In some cases, however, the reduction in the sample size 
also compensates for the overhead computation in the sampling of one variable value. In 
such cases, loop-cutset sampling generated samples faster than Gibbs yielding ratio R < 1. 
Then, the improvement in the accuracy is due both to larger number of samples and to 
faster convergence. 

4.4 ^-Cutset Sampling 

In this section, we compare the general w-cutset scheme for different values of w against 
Gibbs sampling. The main goal is to study how the performance of w-cutset sampling 
varies with w. For completeness sake, we include results of loop-cutset sampling shown in 
section 4.3. 

In this empirical study, we used the greedy algorithm for set cover problem, mentioned 
in section 3.5, for finding minimal tu-cutset. We apply the algorithm in such a manner that 
each (w + l)-cutset is a proper subset of a tt;-cutset and, thus, can be expected to have a 
lower variance and converge faster than sampling on w-cutset in terms of number of samples 
required (following the Rao-Blackwellisation theory). We focus the empirical study on the 
trade-offs between cutset size reduction and the associated increase in sample generation 
time as we gradually increase the bound w. 

We used the same benchmarks as before and included also grid networks. All sampling 
algorithms were given a fixed time bound. When sampling small networks, such as cpcs54 
(w*=15) and cpcsl79 (u>*=8), where exact inference is easy, sampling algorithms were 
allocated 10 seconds and 20 seconds respectively. For larger networks we allocated 100-200 
seconds depending on the complexity of the network which was only a fraction of exact 
computation time. 

Table 1 reports the size of the sampling set used by each algorithm where each column 
reports the size of the corresponding w-cutset. For example, for cpcs360b, the average 
size of Gibbs sample (all nodes except evidence) is 345, the loop-cutset size is 26, the size 
of 2-cutset is 22, and so on. Table 2 shows the rate of sample generation by different 
algorithms per second. As we observed previously in the case of loop-cutset sampling, 
in some special cases cutset sampling generated samples faster than Gibbs sampler. For 
example, for cpcs360b, loop-cutset sampling and 2-cutset sampling generated 600 samples 
per second while the Gibbs sampler was able to generate only 400 samples. We attribute 
this to the size of cutset sample (26 nodes or less as reported in Table 1) compared to the 
size of the Gibbs sample (over 300 nodes). 

CPCS networks. We present two charts. One chart demonstrates the convergence 
over time for several values of w. The second chart depicts the change in the quality 
of approximation (MSE) as a function of w for two time points, at the half of the total 
sampling time and at the end of total sampling time. The performance of Gibbs sampling 
and cutset sampling for cpcs54 is shown in Figure 16. The results are averaged over 20 
instances with 5-10 evidence variables. The graph on the left in Figure 16 shows the mean 
square error of the estimated posterior marginals as a function of time for Gibbs sampling, 
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Table 1: Markov chain sampling set size as a function of w. 
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Table 2: Average number of samples generated per second as a function of w. 



loop-cutset sampling, and w-cutset sampling for w=2, 3, 4, and 5. The second chart shows 
accuracy as a function of w. The first point corresponds to Gibbs sampling; other points 
correspond to loop-cutset sampling and u>-cutset sampling with w ranging from 2 to 6. The 
loop-cutset result is embedded with the w-cutset values at w=5. As explained in section 3.3, 
the loop-cutset corresponds to the w-cutset where w is the maximum number of parents in 
the network. Initially, the best results were obtained by 2- and 3-cutset sampling followed 
by the loop-cutset sampling. With time, 2- and 5-cutset sampling become the best. 

The results for cpcsl79 are reported in Figure 17. Both charts show that loop-cutset 
sampling and ty-cutset sampling for w in range from 2 to 5 are superior to Gibbs sampling. 
The chart on the left shows that the best of the cutset sampling schemes, having the lowest 
MSE curves, are 2- and 3-cutset sampling. The loop-cutset curve falls in between 2- and 
3-cutset at first and is outperformed by both 2- and 3-cutset after 12 seconds. Loop-cutset 
sampling and 2- and 3-cutset sampling outperform Gibbs sampling by nearly two orders of 
magnitude as their MSE falls below 1E-04 while Gibbs MSE remains on the order of 1E- 
02. The 4- and 5-cutset sampling results fall in between, achieving the MSE ~lE-03. The 
curves corresponding to loop-cutset sampling and 2-, 3- and 4-cutset sampling fall below 
the IBP line which means that all four algorithms outperform IBP in the first seconds of 
execution (IBP converges in less than a second). The 5-cutset outperforms IBP after 8 
seconds. In Figure 17 on the right, we see the accuracy results for all sampling algorithms 
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Figure 16: MSE as a function of time (left) and w (right) in cpcs54, 20 instances, time 
bound=12 seconds. 
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Figure 17: MSE as a function of time (left) and w (right) in cpcsl79, 20 instances, time 
bound=12 seconds. Y-scale is exponential due to large variation in performance 
of Gibbs and cutset sampling. 



after 10 seconds and 20 seconds. They are in agreement with the convergence curves on the 
left. 

In cpcs360b (Figure 18), loop-cutset sampling and 2- and 3-cutset sampling have similar 
performance. The accuracy of the estimates slowly degrades as w increases. Loop-cutset 
sampling and ^-cutset sampling substantially outperform Gibbs sampling for all values w 
and exceed the accuracy of IBP within 1 minute. 

On the example of cpcs422b, we demonstrate the significance of the adjusted induced 
width of the conditioned network in the performance of cutset sampling. As we reported 
in section 4.3, its loop-cutset is relatively small \LC\=47 but wlc=^ and thus, sampling 
just one new loop-cutset variable value is exponential in the big adjusted induced width. 
As a result, loop-cutset sampling computes only 4 samples per second while the 2-, 3- 
and 4-cutset, which are only slightly larger having 65, 57, and 50 nodes respectively (see 
Table 1), compute samples at rates of 200, 150, and 90 samples per second (see Table 2). 
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Figure 18: MSE as a function of time (left) and w (right) in cpcs360b, 20 instances, time 
bound=60 seconds. Y-scale is exponential due to large variation in performance 
of Gibbs and cutset sampling. 




Figure 19: MSE as a function of time (left) and w (right) in cpcs422b, 10 instances, time 
bound=200 seconds. Y-scale is exponential due to large variation in performance 
of Gibbs and cutset sampling. 



The 5-cutset that is closest to loop-cutset in size, |C 5 1 = 45, computes 50 samples per 
second which is an order of magnitude more than loop-cutset sampling. The results for 
cpcs422b are shown in Figure 19. The loop-cutset sampling results are excluded due to its 
poor performance. The chart on the right in Figure 19 shows that tu-cutset performed well 
in range of w = 2 — 7 and is far superior to Gibbs sampling. When allowed enough time, 
«;-cutset sampling outperformed IBP as well. The IBP converged in 5 seconds. The 2-, 3-, 
and 4-cutset improved over IBP within 30 seconds, and 5-cutset after 50 seconds. 

Random networks. Results from 10 instances of random multi-partite and 10 in- 
stances of 2-layer networks are shown in Figure 20. As we can see, ^-cutset sampling 
substantially improves over Gibbs sampling and IBP reaching optimal performance for 
w = 2 — 3 for both classes of networks. In this range, its performance is similar to that of 
loop-cutset sampling. In case of 2-layer networks, the accuracy of both Gibbs sampling and 
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Figure 20: Random multi-partite networks (top) and 2-layer networks (bottom), 200 nodes, 
10 instances. MSE as a function of the number of samples (left) and w (right). 
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IBP is an order-of-magnitude less compared to cutset sampling (Figure 20, bottom right). 
The poor convergence and accuracy of IBP on 2-layer networks was observed previously 
(Murphy et al., 1999). 




Figure 21: Random networks, 450 nodes, 10 instances. MSE as a function of the number 
of samples (left) and w (right). 



Grid networks. Grid networks having 450 nodes (15x30) were the only class of bench- 
marks where full Gibbs sampling was able to produce estimates comparable to cutset- 
sampling (Figure 21). With respect to accuracy, the Gibbs sampler, loop-cutset sampling, 
and 3-cutset sampling were the best performers and achieved similar results. Loop-cutset 
sampling was the fastest and most accurate among cutset sampling schemes. Still, it gen- 
erated samples about 4 times more slowly compared to Gibbs sampling (Table 2) since 
loop-cutset is relatively large. The accuracy of loop-cutset sampling was closely followed 
by 2-, 3- and 4-cutset sampling slowly degrading as w increased. Grid networks are an 
example of benchmarks with regular graph structure (that cutset sampling cannot exploit 
to its advantage) and small CPTs (in a two-dimensional grid network each node has at most 
2 parents) where Gibbs sampling is strong. 




Figure 22: Coding networks, 50 code bits, 50 parity check bits, <r=0.4, 100 instances, time 
bound=6 minutes. 
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Table 3: Individual Markov chain length as a function of to. The length of each chain M was 
adjusted for each sampling scheme for each benchmark so that the total processing 
time across all sampling algorithms was the same. 



Coding Networks. The cutset sampling results for coding networks are shown in 
Figure 22. Here, the induced width varied from 18 to 22 allowing for exact inference. 
However, we additionally tested and observed that the complexity of the network grows 
exponentially with the number of coding bits (even after a small increase in the number of 
coding bits to 60 yielding a total of 240 nodes after corresponding adjustments to the number 
of parity-checking bits and transmitted code size, the induced width exceeds 24) while the 
time for each sample generation scales up linearly. We collected results for 10 networks 
(10 different parity check matrices) with 10 different evidence instantiations (total of 100 
instances). In decoding, the Bit Error Rate (BER) is a standard error measure. However, 
we computed MSE over all unobserved nodes to evaluate the quality of approximate results 
more precisely. As expected, Gibbs sampling did not converge (because the Markov chain 
was non-ergodic) and was left off the charts. The charts in Figure 22 show that loop-cutset 
is an optimal choice for coding networks whose performance is closely followed by 2-cutset 
sampling. As we saw earlier, cutset sampling outperforms IBP. 

4.5 Computing an Error Bound 

Second to the issue of convergence of sampling scheme is always the problem of predicting 
the quality of the estimates and deciding when to stop sampling. In this section, we compare 
empirically the error intervals for Gibbs and cutset sampling estimates. 

Gibbs sampling and cutset sampling are guaranteed to converge to the correct posterior 
distribution in ergodic networks. However, it is hard to estimate how many samples are 
needed to achieve a certain degree of convergence. It is possible to derive bounds on the 
absolute error based on sample variance for any sampling method if the samples are inde- 
pendent. In Gibbs and other MCMC methods, samples are dependent and we cannot apply 
the confidence interval estimate directly. In case of Gibbs sampling, we can apply the batch 
means method that is a special case of standardized time series method and is used by the 
BUGS software package (Billingsley, 1968; Geyer, 1992; Steiger & Wilson, 2001). 
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The main idea is to "split" a Markov chain of length M ■ T into M chains of length 
T. Let P m (xi\e) be an estimate derived from a single chain m £ [1,...,M] of length T 
(meaning, containing T samples) as defined in equations (28)-(29). The estimates P m (x\e) 
are assumed approximately independent for large enough M. Assuming that convergence 
conditions are satisfied and the central limit theorem holds, the P m {x\e) is distributed 
according to N(E[P(xi\e)], a 2 ) so that the posterior marginal P(JQ|e) is obtained as an 
average of the M results obtained from each chain, namely: 



1 M 

P ^ e ) = m £ Pm{ 

m=l 

and the sampling variance is computed as usually: 



x e 



(33) 



M 



1 M 

— J2(Pm(x\e)-P(x\e)f 



m=l 



An equivalent expression for the sampling variance is: 

£^ =1 P*(*|e)-MP 2 (x|e) 



a 2 = 



M-l 



(34) 



where a is easy to compute incrementally storing only the running sums of P m (x\e) and 
P^(x|e). Therefore, we can compute the confidence interval in the 100(1 — a) percentile 
used for random variables with normal distribution for small sampling set sizes. Namely: 



P(x\e) e [P(x\e)±t<* t{M _ 1) y — 



l-a 



(35) 



where i< 



,(M-1) 



is a table value from t distribution with (M — 1) degrees of freedom. 



We used the batch means approach to estimate the confidence interval in the posterior 
marginals with one modification. Since we were working with relatively small sample sets 
(a few thousand samples) and the notion of "large enough" M is not well defined, we 
restarted the chain after every T samples to guarantee that the estimates P m (x\e) were 
truly independent. The method of batch means only provides meaningful error estimates 
assuming that the samples are drawn from the stationary distribution. We assume that in 
our problems the chains mix fast enough so that the samples are drawn from the target 
distribution. 

We applied this approach to estimate the error bound in the Gibbs sampler and the 
cutset sampler. We have computed a 90% confidence interval for the estimated posterior 
marginal P(xi\e) based on the sampling variance of P m (xi\e) over 20 Markov chains as 
described above. We computed sampling variance a 2 from Eq.(34) and the 90% confidence 
interval Ao.g(xj) from Eq.(35) and averaged over all nodes: 

A .9 = Arv ^ Lvm ^ ^2 A o.d(xi) 

i Xiev(Xi) 



The estimated confidence interval can be too large to be practical. Thus, we compared A0.9 
with the empirical average absolute error A: 
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Table 4: Average absolute error A (measured) and estimated confidence interval A0.9 as a 
function of w over 20 Markov Chains. 



A = AfE,|p W | E E l*W<0-fM«)) 

The objective of this study was to observe whether the computed confidence interval A0.9 
(estimated absolute error) accurately reflects the true absolute error A, namely, to verify 
that A < A0.9, an d if so, then investigate empirically whether confidence interval for cutset- 
sampling estimates will be smaller compared to Gibbs sampling as we would expect due to 
variance reduction. 

Table 4 presents the average confidence interval and average absolute error for our 
benchmarks. For each benchmark, the first row of results (row A) reports the average 
absolute error and the second row of results (row A0.9) reports the 90% confidence interval. 
Each column in Table 4 corresponds to a sampling scheme. The first column reports results 
for Gibbs sampling. The second column reports results for loop-cutset sampling. The 
remaining columns report results for w-cutset sampling for w in range 2 — 6. The loop-cutset 
sampling results for cpcs422b are not included due to statistically insignificant number of 
samples generated by loop-cutset sampling. The Gibbs sampling results for coding networks 
are left out because the network is not ergodic (as mentioned earlier) and Gibbs sampling 
does not converge. 

We can see that for all the networks A < A0.9 which validates our method for measuring 
confidence interval. In most cases the estimated confidence interval A0.9 is no more than 
2-3 times the size of average error A and is relatively small. In case of cutset sampling, the 
largest confidence interval maxAo.9 = 0.00247 is reported in grid networks for loop-cutset 
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sampling. Thus, the confidence interval estimate could be used as a criteria reflecting the 
quality of the posterior marginal estimate by the sampling algorithm in practice. Subse- 
quently, comparing the results for Gibbs sampling and cutset sampling, we observe not 
only a significant reduction in the average absolute error, but also a similar reduction in the 
estimated confidence interval. Across all benchmarks, the estimated confidence interval of 
the Gibbs sampler remains A0.9 > 1E-3. At the same time, for cutset sampling we obtain 
A0.9 < 1E-3 in 5 out of 8 classes of networks (excluded are the cpcsl79, grid, and 2-layer 
networks) . 

4.6 Discussion 

Our empirical evaluation of the performance of cutset sampling demonstrates that, except 
for grid networks, sampling on a cutset usually outperforms Gibbs sampling. We show that 
convergence of cutset sampling in terms of number of samples dramatically improves as 
predicted theoretically. 

The experiments clearly show that there exists a range of ^-values where to-cutset 
sampling outperforms Gibbs sampler. The performance of w-cutset sampling deteriorates 
when increase in w yields only a small reduction in the cutset size. An example is cpcs360b 
network where starting with w=4, increasing w by 1 results in the reducing the sampling 
set by only 1 node (shown in Table 1). 

We observe that the loop-cutset is a good choice of cutset sampling as long as the 
induced width of network wlc conditioned on loop-cutset is reasonably small. When wlc 
is large (as in cpcs422b), loop-cutset sampling is computationally less efficient then iw-cutset 
sampling for w < wlc- 

We also showed in Section 4.3 that both Gibbs sampling and loop-cutset sampling 
outperform the state-of-the-art AIS-BN adaptive importance sampling method when the 
probability of evidence is small. Consequently, all the w-cutset sampling schemes in Sec- 
tion 4.4 that outperformed Gibbs sampler in cpcs360b and cpcs422b would also outperfrom 
AIS-BN. 

5. Related Work 

We mention here some related work. The idea of marginalising out some variables to improve 
efficiency of Gibbs sampling was first proposed by Liu et al. (1994). It was successfully 
applied in several special classes of Bayesian models. Kong et al. (1994) applied collapsing 
to the bivariate Gaussian problem with missing data. Liu (1994) defined a collapsed Gibbs 
sampling algorithm for finding repetitive motifs in biological sequences applies by integrating 
out two parameters from the model. Similarly, Gibbs sampling set is collapsed in Escobar 
(1994), MacEachern (1994), and Liu (1996) for learning the nonparametric Bayes problem. 
In all of the instances above, special relationships between problem variables have been 
exploited to integrate several variables out resulting in a collapsed Gibbs sampling approach. 
Compared to this previous research work, our contribution is in defining a generic scheme 
for collapsing Gibbs sampling in Bayesian networks which takes advantage of the network's 
graph properties and does not depend on the specific form of the relationships between 
variables. 
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Jensen et al. (1995) combined sampling and exact inference in a blocking Gibbs sampling 
scheme. Groups of variables were sampled simultaneously using exact inference to compute 
the needed conditional distributions. Their empirical results demonstrate a significant im- 
provement in the convergence of the Gibbs sampler over time. Yet, in proposed blocking 
Gibbs sampling, the sample contains all variables in the network. In contrast, cutset sam- 
pling reduces the set of variables that are sampled. As noted previously, collapsing produces 
lower variance estimates than blocking and, therefore, cutset sampling should require fewer 
samples to converge. 

A different combination of sampling and exact inference for join-trees was described 
by Roller et al. (1998) and Kjaerulff (1995). oiler et al. and Kjaerulff proposed to sample 
the probability distribution in each cluster for computing the outgoing messages. Kjaerulff 
used Gibbs sampling only for large clusters to estimate the joint probability distribution 
P(Vi),Vi C X in cluster i. The estimated P{Vi) is recorded instead of the true joint 
distribution to conserve memory. The motivation is that only high-probability tuples will 
be recorded while the remaining low-probability tuples are assumed to have probability 0. 
In small clusters, the exact joint distribution P(Vi) is computed and recorded. However, the 
paper does not analyze the introduced errors or compare the performance of this scheme 
with standard Gibbs sampler or the exact algorithm. No analysis of error is given nor 
comparison with other approaches. 

Koller et al. (1998) used sampling used to compute messages sent from cluster i to 
cluster j and the posterior joint distributions in a cluster-tree that contains both discrete 
and continuous variables. This approach subsumes the cluster-based sampling proposed 
by Kjaerulff (1995) and includes rigorous analysis of the error in the estimated posterior 
distributions. The method has difficulties with propagation of evidence. The empirical 
evaluation is limited to two hybrid network instances and compares the quality of the 
estimates to those of likelihood weighting, an instance of importance sampling that does 
not perform well in presence of low-probability evidence. 

The effectiveness of collapsing of sampling set has been demonstrated previously in the 
context of Particle Filtering method for Dynamic Bayesian networks (Doucet, Andrieu, & 
Godsill, 2000a; Doucet, deFreitas, & Gordon, 2001; Doucet, de Freitas, Murphy, & Russell, 
2000b). It was shown that sampling from a subspace combined with exact inference (Rao- 
Blackwellised Particle Filtering) yields a better approximation than Particle Filtering on 
the full set of variables. However, the objective of the study has been limited to observation 
of the effect in special cases where some of the variables can be integrated out easily. Our 
cutset sampling scheme offers a generic approach to collapsing a Gibbs sampler in any 
Bayesian network. 

6. Conclusion 

The paper presents the w-cutset sampling scheme, a general scheme for collapsing Gibbs 
sampler in Bayesian networks. We showed theoretically and empirically that cutset sam- 
pling improves the convergence rate and allows sampling from non-ergodic network that 
has ergodic subspace. By collapsing the sampling set, we reduce the dependence between 
samples by marginalising out some of the highly correlated variables and smoothing the 
sampling distributions of the remaining variables. The estimators obtained by sampling 
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from a lower-dimensional space also have a lower sampling variance. Using the induced 
width w as a controlling parameter, tu-cutset sampling provides a mechanism for balancing 
sampling and exact inference. 

We studied the power of cutset sampling when the sampling set is a loop-cutset and, 
more generally, when the sampling set is a ^-cutset of the network (defined as a subset of 
variables such that, when instantiated, the induced width of the network is < w). Based 
on Rao-Blackwell theorem, cutset sampling requires fewer samples than regular sampling 
for convergence. Our experiments showed that this reduction in number of samples was 
time-wise cost-effective. We confirmed this over a range of randomly generated and real 
benchmarks. We also demonstrated that cutset sampling is superior to the state of the art 
AIS-BN importance sampling algorithm when the probability of evidence is small. 

Since the size of the cutset and the correlations between the variables are two main 
factors contributing to the speed of convergence, w-cutset sampling may be optimized fur- 
ther with the advancement of methods for finding minimal tw-cutset. Another promising 
direction for future research is to incorporate the heuristics for avoiding selecting strongly- 
correlated variables into a cutset since those correlations are driving factors in the speed 
of convergence of Gibbs sampling. Alternatively, we could combine sample collapsing with 
blocking. 

In summary, w-cutset sampling scheme is a simple yet powerful extension of sampling 
in Bayesian networks that is likely to dominate regular sampling for any sampling method. 
While we focused on Gibbs sampling with better convergence characteristics, other sampling 
schemes can be implemented with the cutset sampling principle. In particular, it was 
adapted for use with likelihood weighting (Bidyuk & Dechter, 2006). 
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